This report is created by Shubham Bishnoi (shubishnoi@gmail.com).

Table of contents


Objective

The objective of this report is to demonstrate the process of importing NFL data from https://www.kaggle.com/c/NFL-Punt-Analytics-Competition/data to the SQLite database, analyzing and visualizing the NFL game data, and suggesting rules to improve player safety during punt plays.


Tools and libraries used

  • R
    • dplyr
    • DBI
    • ggplot2
    • plotly
    • gapminder
  • Python
    • Spark

Creating NFL.db

I started by downloading the data set from the website and importing it in SQLite. I went through the data types of all the attributes in the tables and made changes accordingly. For example, GSISID was imported as real and changed to an integer in all of the tables. Since the tool must import and pull from the SQLite database directly, the connection to the existing NFL.db database is made:

setwd("/Users/shubhambishnoi/Desktop/Documents/Brookfield/Case Study/")
NFL <- dbConnect(RSQLite::SQLite(), "NFL.db")

Analyzing the NFL data

The data of all the punts (play_information) and punts that resulted in a player getting a concussion (video_review) was analyzed.

play_info <- tbl(NFL, "play_information")
play_info <- play_info %>%
  collect()

injuries <- tbl(NFL, "video_review")
injuries <- injuries %>%
  collect()

paste("Concussion rate in Punts play is ",
      round(NROW(injuries)/NROW(play_info),4) * 100,"%", sep = "")
## [1] "Concussion rate in Punts play is 0.55%"
paste("Average punts per game is ",
      round(NROW(play_info$GameKey)/NROW(unique(play_info$GameKey)),1), sep = "")
## [1] "Average punts per game is 10.1"
paste("Number of players injured in a season ",
      round(5.5/99 * 256,2), sep = "")
## [1] "Number of players injured in a season 14.22"

5.5 players on average get injured in 1000 punts and there are 10.1 punts on average in each game. Therefore, for every 99 games there are 5.5 players that get injured. Leading to 14.22 players getting injured over the regular season.

To figure out the most commonly used terms in the play description, I used Spark in Python to figure out top 50 words.

write.table(play_info$PlayDescription,"filename.txt",sep="\t",row.names=FALSE)
import findspark
findspark.init("/u/cs451/packages/spark")
from pyspark import SparkContext, SparkConf
import re
sc = SparkContext(appName="YourTest", master="local[2]")

# this captures sequences of alphabetic characters
def simple_tokenize(s):
    return re.findall(r"[a-z]+(?:'[a-z]+)?",s.lower())

# Returns a list of the top 50 (count, token) tuples
def top_50_tokens():
    text = sc.textFile("filename.txt")
    rdd1 = text.flatMap(lambda x: list(set(simple_tokenize(x))))
    rdd2 = rdd1.map(lambda x: (x,1)).reduceByKey(lambda x,y:x+y)
    rdd3 = rdd2.map(lambda x:(x[1],x[0])).sortByKey(False)
    return rdd3.take(50)

top_50_tokens()

These were the most common useful words:

[(‘yards’: 6643), (‘center’: 6489), (‘punts’: 6462), (‘punt’: 2151), (‘formation’: 2098), (‘catch’: 1880), (‘fair’: 1663), (‘penalty’: 1142), (‘enforced’: 1098), (‘downed’: 811), (‘of’: 776), (‘out’: 714), (‘bounds’: 708), (‘holding’: 451), (‘offensive’: 445), (‘pushed’: 435), (‘zone’: 428), (‘end’: 427), (‘touchback’: 407), (‘gain’: 380), (‘play’: 370), (‘illegal’: 323)]

It is understandable that play description refers to ‘yards’ and ‘punts’ prominently. ‘Fair catch’, ‘penalty’, ‘downed’, ‘out of bounds’, and ‘touchback’ is mentioned a lot.

play_info$NewKey <- paste(as.character(play_info$GameKey),"-",as.character(play_info$PlayID))
injuries$NewKey <- paste(as.character(injuries$GameKey),"-",as.character(injuries$PlayID))
newConKey <- c(injuries$NewKey)
Outcomes <- c("Downed","Muff","Touchback","Out Of Bounds","Fair Catch","Penalty","Return")
Total <- c(0,0,0,0,0,0,0)
Concussion <- c(0,0,0,0,0,0,0)
myData <- data.frame(Outcomes,Total,Concussion)
for (i in 1:NROW(play_info)){
  temp <- play_info$PlayDescription[i]
  temp2 <- (c(strsplit(temp, "[,. ]+")))[[1]]
  temp3 <- (play_info$NewKey)[i]
  if ("downed" %in% temp2){
    if (temp3 %in% newConKey){
      myData[1,3] <- myData[1,3] + 1
    }
    myData[1,2] <- myData[1,2] + 1
  }
  if ("MUFFS" %in% temp2){
    if (temp3 %in% newConKey){
      myData[2,3] <- myData[2,3] + 1
    }
    myData[2,2] <- myData[2,2] + 1
  } 
  if ("Touchback" %in% temp2){
    if (temp3 %in% newConKey){
      myData[3,3] <- myData[3,3] + 1
    }
    myData[3,2] <- myData[3,2] + 1
  }  
  if ("out" %in% temp2 & "bounds" %in% temp2 & "of" %in% temp2){
    if (temp3 %in% newConKey){
      myData[4,3] <- myData[4,3] + 1
    }
    myData[4,2] <- myData[4,2] + 1
  } 
  if ("fair" %in% temp2 & "catch" %in% temp2){
    if (temp3 %in% newConKey){
      myData[5,3] <- myData[5,3] + 1
    }
    myData[5,2] <- myData[5,2] + 1
  } 
  if ("PENALTY" %in% temp2){
    if (temp3 %in% newConKey){
      myData[6,3] <- myData[6,3] + 1
    }
    myData[6,2] <- myData[6,2] + 1
  } 
  if ("for" %in% temp2){
    if (temp3 %in% newConKey){
      myData[7,3] <- myData[7,3] + 1
    }
    myData[7,2] <- myData[7,2] + 1
  }
}
print(myData)
##        Outcomes Total Concussion
## 1        Downed   811          3
## 2          Muff   203          2
## 3     Touchback   407          0
## 4 Out Of Bounds   669          0
## 5    Fair Catch  1659          2
## 6       Penalty  1077         10
## 7        Return  2977         32
ggplot(myData, aes(Outcomes, Concussion/Total * 1000, colour = Outcomes)) + 
  geom_point() + labs(y = "Injuries per 1000 Events")

The most number of concussions are related to ‘return’, ‘muff’, and ‘penalty’.

Next, the activities performed causing the concussion were explored.

activity <- unique(injuries$Player_Activity_Derived)
freq <- c(0,0,0,0)
myData2 <- data.frame(activity,freq)
for (i in 1:NROW(myData2)){
  myData2[i,2] <- NROW(injuries[injuries$Player_Activity_Derived == myData2[i,1],])
}
print(myData2)
##   activity freq
## 1 Tackling   13
## 2  Blocked   10
## 3 Blocking    8
## 4  Tackled    6
ggplot(myData2, aes(activity, freq, colour = activity)) + 
  geom_point()

Players get more concussions while tackling or blocking instead of getting blocked or tackled.

Next, I thought it will be interesting to visualize a sample game, specifically at the time a player gets a concussion:

N16Pre <- tbl(NFL, "NGS-2016-pre")

N16Pre <- N16Pre %>%
  select(`GameKey`,`PlayID`, `GSISID`,`Time`,`x`,`y`,`o`,`dir`,`Event`) %>% 
  filter(`GameKey` == 21) %>% 
  arrange(`Time`)

N16PreCon <- N16Pre %>%
  filter(`GSISID` == 29343) %>%
  select(`Time`) %>%
  collect()

start <- N16PreCon$Time[1]
end <- N16PreCon$Time[NROW(N16PreCon)]
N16PreTotal <- N16Pre %>%
  filter(`Time` >= start) %>% 
  filter(`Time` <= end) %>% 
  collect()

N16PreTotal$role <- "Player"
N16PreTotal$role[N16PreTotal$GSISID == 29343] <- "Player got concussion"
#N16PreTotal$role <- as.factor(N16PreTotal$role)
p <- N16PreTotal %>%
  plot_ly(
    x = ~x, 
    y = ~y, 
    frame = ~Time, 
    color = ~role,
    type = 'scatter',
    mode = 'markers',
    colors = "Dark2",
    xlim = c(0,120),
    ylim = c(0,53.3)
  )
p

We can see the movement of the player until the injury takes place.

The punts data was further analyzed to see the variation in the number of punts in different games.

myData3 <- data.frame(play_info$GameKey)
myData3$freq <- 1
agg = aggregate(myData3,
                by = list(myData3$play_info.GameKey),
                FUN = sum)

boxplot(agg$freq, ylab = "Number of punts", main = " Boxplot for number of punts in NFL games of 2016 and 2017")

The number of injuries were analyzed in three groups of punts per game; 0-6 punts, 7-13 punts and 14+ punts per game.

gameInj <- c(injuries$GameKey)
agg$injury <- 0
for (i in 1:NROW(agg)){
  agg$injury[i] <- NROW(injuries$GameKey[injuries$GameKey == agg$Group.1[i]])
}
agg <- agg[,-c(2)]
Groups <- c("0-6 punts","7-13 punts","14+ punts")
PuntInjury <- c(0,0,0)
GameTotal <- c(NROW(agg$Group.1[agg$freq <= 6]), NROW(agg$Group.1[agg$freq > 6 & agg$freq <= 13]), NROW(agg$Group.1[agg$freq >= 14]))
myData4 <- data.frame(Groups,PuntInjury,GameTotal)
for (i in 1:NROW(agg)){
  if (agg$freq[i] <= 6){
    myData4$PuntInjury[1] = myData4$PuntInjury[1] + agg$injury[i]
  } else if (agg$freq[i] <= 13){
    myData4$PuntInjury[2] = myData4$PuntInjury[2] + agg$injury[i]
  } else {
    myData4$PuntInjury[3] = myData4$PuntInjury[3] + agg$injury[i]
  }
}
print(myData4)
##       Groups PuntInjury GameTotal
## 1  0-6 punts          1       108
## 2 7-13 punts         25       438
## 3  14+ punts         11       116

If there were less than or equal to 6 punts per game, there was only 1 injury in 108 games. Therefore I feel if the number of punts are restricted each game to say 3 each team then there will be less injuries. The number of injured players will go down from 14.22 to 2.37.

dbDisconnect(NFL)

Conclusion

For the purpose of this report and providing a recommendation, I researched the rules of the NFL, how the game is played and different terms that are used in it. Interestingly, there was a NFL game between Kansas City Chiefs and Baltimore Ravens yesterday, so it was nice to watch the game and enjoy the technicalities of the sport.

I analyzed various tables of the NFL data set and came up with a rule: Players should be allowed a maximum of 3 number of punts throughout a game.

This will decrease the number of injuries from 14.22 to 2.37 on average in the season.